IllusionVQA

A Challenging Optical Illusion Dataset for Vision Language Models

1Bangladesh University of Engineering and Technology, 2University of California Los Angeles, 3University of California Riverside
*Equal Contribution
Spider Graph

Performance of state-of-the-art VLMs and human evaluators on Logo IllusionVQA Comprehension task.

Bar Chart

Performance of state-of-the-art VLMs and human evaluators on Logo IllusionVQA Soft Localization task.

Abstract

The advent of Vision Language Models (VLM) has allowed researchers to investigate the visual understanding of a neural network using natural language. Beyond object classification and detection, VLMs are capable of visual comprehension and common-sense reasoning. This naturally led to the question: How do VLMs respond when the image itself is inherently unreasonable?

To this end, we present Logo IllusionVQA : a diverse dataset of challenging optical illusions and hard-to-interpret scenes to test the capability of VLMs in two distinct multiple-choice Visual Question Answering tasks - comprehension and soft localization.

GPT4V, the best performing VLMs, achieves 62.99% accuracy (4-shot) on the comprehension task and 49.7% on the localization task (4-shot and Chain-of-Thought). Human evaluation reveals that humans achieve 91.03% and 100% accuracy in comprehension and localization. We discover that In-Context Learning (ICL) and Chain-of-Thought reasoning substantially degrade the performance of Gemini-Pro on the localization task. Tangentially, we discover a potential weakness in the ICL capabilities of VLMs: they fail to locate optical illusions even when the correct answer is in the context window as a few-shot example.

Comparison

Comparison of Logo IllusionVQA with prior illusion datasets - GVIL and Hallusion-Bench.

IllusionVQA Dataset

Overview

Logo IllusionVQA is a Visual Question Answering (VQA) dataset with two sub-tasks. The first task tests comprehension on 435 instances in 12 optical illusion categories. Each instance consists of an image with an optical illusion, a question, and 3 to 6 options, one of which is the correct answer. We refer to this task as Logo IllusionVQA-Comprehension. The second task tests how well VLMs can differentiate geometrically impossible objects from ordinary objects when two objects are presented side by side. The task consists of 1000 instances following a similar format to the first task. We refer to this task as Logo IllusionVQA-Soft-Localization .

Examples

Illusion Examples

One examples for each type of illusion in Logo IllusionVQA-Comprehension

Comprehension Task

Samples questions from Logo IllusionVQA-Comprehension

Soft Localization Task

Samples questions from Logo IllusionVQA-Soft-Localization

Leaderboard

# Model Source Date ALL Impossible Object Real-Scene Size Hidden Deceptive Design Angle Illusion Color Edited-Scene Upside-Down Pos.-Neg. Space Circle-Spiral Miscellaneous
- Human Performance* Link 2024-02-25 91.03 98.51 98.44 63.04 100 94.59 84.62 60.87 100 100 100 66.67 89.47
1 GPT4V (4-shot) Link 2024-02-25 62.99 58.96 54.69 69.57 46.67 72.97 84.62 82.61 80.95 71.43 85.71 33.33 42.11
2 GPT4V (0-shot) Link 2024-02-25 58.85 55.22 57.81 58.70 51.11 70.27 69.23 69.57 71.43 71.43 57.14 50 42.11
3 Gemini (4-shot) Link 2024-02-25 52.87 56.72 46.88 52.17 48.89 67.56 50 17.39 66.67 57.14 71.43 33.33 57.89
4 Gemini (0-shot) Link 2024-02-25 51.26 56.72 46.88 45.65 42.22 64.86 53.85 17.39 66.67 57.14 85.71 33.33 52.63
5 LLaVa (0-shot) Link 2024-02-25 40 43.28 42.19 19.57 42.22 43.24 38.46 26.09 61.90 71.43 42.86 0.00 42.11
6 Cog (0-shot) Link 2024-02-25 38.16 44.03 34.38 13.04 42.22 45.95 30.77 30.43 42.86 71.43 71.43 16.67 42.11
7 I-BLIP (0-shot) Link 2024-02-25 34.25 34.22 26.56 26.09 44.44 37.84 30.77 30.43 42.86 42.86 57.41 33.33 36.84

The columns represent the different types of illusions in the dataset.

🚨 To submit your results to the leaderboard, please send to this email with your result json files.

🚨 For more submission details, please refer to this link

Experiment Results

Data Visualization

Coming soon!

BibTeX

@misc{shahgir2024illusionvqa,
      title={IllusionVQA: A Challenging Optical Illusion Dataset for Vision Language Models}, 
      author={Haz Sameen Shahgir and Khondker Salman Sayeed and Abhik Bhattacharjee and Wasi Uddin Ahmad and Yue Dong and Rifat Shahriyar},
      year={2024},
      eprint={2403.15952},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
}